Skip to content

Conversation

@tvegas1
Copy link
Contributor

@tvegas1 tvegas1 commented Oct 3, 2025

What?

Prefer interfaces that are closer to the memory to send. We go from 15GB/s to 357GB/s in some cases.

Why?

Some systems have HCAs and a GPU under the same PCI bridge. In that case, although all interfaces have same bandwidth, protocol selection should first select those. This is particularly true when using multiple GPUs with pairwise traffic (GPUx to remote GPUx), where traffic should remain on sibling interfaces.

How?

The code fetches the interfaces bandwidth and checks if it needs to further restrict it due to system topology (three values infinite/17GBs/0.2GBs/). In case of slower interfaces (11GBs) this does not work and we cannot discriminate interfaces under same PCI bridge or not.

Proposal is to boost bandwidth in such case but ideally bw/lat/distance and even partitioning of the traffic should be taken into account. We could also have proto selection not rely only on raw bw.

(device_sys_dev != UCS_SYS_DEVICE_ID_UNKNOWN) &&
(sys_dev != device_sys_dev) &&
ucs_topo_is_pci_bridge(device_sys_dev, sys_dev)) {
tl_perf->bandwidth *= 1.2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok as a quickfix

I agree about taking into account multiple factors when choosing a protocol. This is actually implemented in protocol variants: #10778
The idea is to select lanes by score (not just BW), where both BW and latency contributes. With this approach the better way maybe would be to decrease system latency for iface on the same PCI bridge (or increase if not).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants